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1 Introduction 

In a previous report the design concepts of Charon [5] were presented. Charon is a toolkit 
that aids engineers in developing scientific programs for structured- grid applications to be 
run on MIMD parallel computers. It constitutes an augmentation of the general-purpose 
MPI-based [4] message-passing layer, and provides the user with a hierarchy of tools for 
rapid prototyping and validation of parallel programs, and subsequent piecemeal performance 
tuning. 

Here we describe the mplement at ion of the domain decomposition t ools used for creating 
data distributions across sets of processors. We also present the hierarchy of parallelizat ion 
tools that allows smooth translation of legacy code (or a serial design) into a parallel program. 
Along with the actual tool descriptions, we will present the considerations that led to the 
particular design choices Many of these are motivated by the requirement that Charon must 
be useful within the tra< itional computational environments of Fortran 77 and C. Only the 
Fort ran 77 synt ax will be presented in this report . 

2 A multi-level, orthogonal design 

In Charon we distinguish between data distribution support and parallel execution support. 
They are orthogonal elements of the design space, meaning that high-level data distribution 
functions can be applied in conjunction with low-level parallel execution support tools, and 
vice versa. Much of the trouble in the implementation of advanced algorithms on MIMD 
message-passing systems stems from the fact that data distribution and concurrent execu- 
tion on distributed data structures cannot be separated. Those systems that do allow the 
separation between the two are usually very restricted in both the type of operations and 
the type of distributions allowed, as was detailed in [5]. Charon is able to provide virtually 
complete freedom in data distribution while still offering powerful support tools for the con- 
trol of the program flow, because it treats parallelization as an incremental process, whose 
Pinal product must be very efficient, but whose intermediate stages are allowed to be slow. 
The foremost decision, made by the user, is the choice of the data distribution (possibly dy- 
namic). Whereas the lowest level of abstraction in Charon features close integration of data 


distribution and parallel execution environment, the highest level of abstraction contains 
support tools that allow the program to execute correctly on a distributed data set. wilh 
minimal structural and textual changes in the serial program text. This is accomplished as 
follows. 

All arraj's that need to be distributed register with a distribution utility that structures 
the local storage space for a segment (or segments) of the the array (see Section 3). All 
other variables are global, which means that they exist on all processors, and have the same 
value on all. Moreover, all processors execute the same program statements. Whenever 
an assignment to an element of a distributed array takes place, the owner-computes rule is 
invoked, which means that only one processor performs the actual assignment . Whenever an 
assignment requires the value of a distributed array element, a communication takes place 
automatically if the element is remote. The mechanism for such flexible assignments, which 
makes use of the Charon functions assign, address, and value, is described in Section 4. 
Complications may occur, for example when a user function is called that takes as input 
distributed array elements, and uses their addresses to modify nearby memory locations 
without providing explicit information regarding the position of those locations within the 
distributed array. In that case more versatile access mechanisms need to be applied that 
(may) fetch blocks of data, and that regulate execution of functions depending on which 
processor owns the data being modified. For this purpose the Charon functions invoke 
and mvalue are provided. Despite this flexibility, there are still certain program structures 
that cannot be expressed using the top-level functions in Charon. That is because those 
functions rely on atomicity of the owner-computes rule. Atomicity is automatically satisfied 
by a single assignment, but can be violated by user-defined or library functions operating on 
pointers. If a function modifies multiple distinct distributed arrays, or different parts of the 
same distributed array, there can be a conflict about ownership of the data to be modified. It 
is the responsibility of the user to resolve such conflicts, or to guarantee program correctness 
implicitly. 

3 Distribution support tools 

Charon supports the parallelization of programs using multi dimensional arrays related to 
structured grids. The data distribution process consists of three fundamental steps: 

1. Define a grid and create a partitioning using Cartesian sections. The result is called a 

partition. 

2. Assign partit ions to processors. The result is called a decomposition. 

3. Create the multi-dimensional, distributed array and associate it with a decomposition. 

The result is called a distributed grid variable, or distribution . 

Common decompositions, such as uni-partitions or diagonal multi-partitions (see [5, 6]). 
can be created with a few high-level decomposition functions. Customizations are performed 
using lower-level functions. In the following description of Charon functions, the integer 
variables grid, partition, decomposition, and distribution (in typewriter font.) are 
handles to the corresponding data structures. 
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create_grid, set-grid_size, and set^grid_start are used to define a discretization grid 
of a certain dimensionality, to specify the size of the grid in a particular dimension, and to 
redefine the starting inde.K of the grid in a specific dimension (default is specified through the 
Charon function set_def ault_of f set; see below), respectively. The grid and all subsequent 
constructs based on it are restricted to the processors in the MPT communicator specified in 
create^grid. 

Syntax of Fortran 77 grid creation functions. 

subroutine create_.grid(grid , communicator ,no_dimensions) 
integer communicator, grid, no_dimensions 

subroutine set _gr:.d_ size (grid , dimension , size) 
integer grid, dimension, size 

subrout me set_gr:.d_start (grid , dimension , start) 
integer grid, dimension, start 

Create_part it ion, setmo_cuts, set.cut, and set_even_cuts are used to define a par- 
tition, to specify the number of cuts in a certain dimension, to (re)define the value of a 
particular cut (a value of n means that a separator is placed between points n — 1 and 
n), and to space cuts in a certain dimension as evenly as possible, respectively. If no ex- 
actly uniform division is possible, set_even_cuts will augment the size of the low numbered 
partitions by one unit until the leftover points have been exhausted. 

Syntax of Fortran 77 partitioning functions: 

subroutine create.. part it ion (partition , grid) 
integer partition, grid 

subroutine set.no. .cuts (part it ion, dimens ion, no.cuts) 
integer partition, dimension, no.cuts 

subroutine set_cut (partitio , dimension, cut , value) 
integer partition, dimension, cut, value 

subroutine set.ev an. cuts (part it ion, dimens ion) 
integer partition, dimension 

create.decompositi on and set_owner are used to define a decomposition and to (re)set 
ownership of a particular partition, respectively. Ownership is signified by the rank of the 
processor within the communicator. Because grids can have different dimensionality, the 
number of indices needed to identify a partition can vary. It is the user’s responsibility to 
supply the correct number. 

Syntax of Fortran 77 decomposition functions: 

subrout ine create .decomposition (decomposition , part it ion) 
integer decomposition, partition 
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subroutine set__owner (decomposition, owner_rank, indexl , index2, . . .) 
integer decomposition, owner_rank, index2, index2, ... 

Note that Fortran 77 requires a fixed number of parameters for each function or sub- 
routine. The definition of set.owner and several other Charon functions does not conform 
to that standard, which may lead to problems on some computer systems. These can be 
resolved by using a Fortran 90 compiler instead, which will automatically insert default 
values for the dummy indices that are left unspecified. Whereas the set .owner routine suf- 
fices to construct any type of decomposition, it is usually preferable to create commonly 
used decompositions using a few high-level routines. It is simpler, less error-prone, and also 
more efficient; Charon can use optimized interrogation and communication calls when the 
decomposition has a known, regular structure. The common decompositions currently sup- 
ported by Charon are uni-partitions (each processor is assigned a single partition), diagonal 
multi-partitions (each processor is assigned several partitions in a regular pattern [6]), and 
High Performance Fortran-style [2] block-cyclic distributions. The predefined decomposi- 
tions assume partitioning of the grid in all dimensions. However, particular dimensions can 
be excluded by invoking the command exclude_partit ion_dimension. In multi-partition 
decompositions at least two grid dimensions must be partitioned. By default, t he uni- 
partition decomposition minimizes aspect ratios in the partitioned dimensions (for example, 
a grid of 80 x 20 points would be divided in 16 partitions of 10 x 10 points). Customiza- 
tion is obtained by specifying the number of processors in each particular grid dimension 
(set_no_processors). By default, the block-cyclic decomposition uses a cyclic distribu- 
tion with a group size of one in all partitioned dimensions. The number of processors to 
be applied to each partitioned grid dimension is as close to equal as possible. Different 
numbers of processors and different group sizes can be specified using set_no_processors 
and set_group_size, respectively. In addition, a block decomposition can be specified for 
selected dimensions using set_block_part it ion. When a decomposition is constructed us- 
ing high-level Charon functions, a partition is created implicitly. The function partition 
returns a handle to that partition. 

Syntax of Fortran 77 high-level decomposition functions; 

subrout ine creat e_unipart it ion (decompo s it ion , grid) 
integer decomposition, grid 

subroutine cr eat e_mult ipart it ion (decomposition, grid) 
integer decomposition, grid 

subroutine create_block_cyclic (decomposition, grid) 
integer decomposition, grid 

subroutine exclude_part it ion_dimens ion (decomposition, dimens ion) 
integer decomposition, dimension 

subroutine set _no_processors (decompo s it ion , dimens ion ,no_procs) 
integer decomposition, dimension, no.procs) 
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subroutine set _grc>up_size(decompos it ion , dimens ion, size) 
integer decomposition , dimension, size) 

subrout ine set_block_part it ion (decomposition , dimension) 
integer decomposition, dimension 

integer function part it ion(decomposition) 
integer decomposition 

Note that the numbering of all array elements requires the definition of an array ollset. For 
Fortran the default offset is 1. For C and C++ it is 0. Consequently, the first grid dimension 
has index 1 in Fortran, and index 0 in C. The default offset can be changed by calling 
set .default _of f set (offset) . 

In addition to creation and assignment functions, there also exist destruction and inter- 
rogation functions lor most of the above construct s. Where applicable, these are defined by 
replacing create with delete, or by leaving off set_, respectively. It is never required to 
delete a data structure when it. is no longer needed, but in extreme cases, when many calls 
are made to the creation functions, space may become tight. 

A quick way to customize a decomposition is to create one using a predefined high- 
level routine, and then modify it. For example, one may want to use a three-dimensional 
diagonal multi-partition scheme, but partitions owned by processor 2 should be transferred 
to processor 7: 

do kp = 1 , no_partitions(decomposition,3) 
do jp = 1, no_part itions(decomposition,2) 
do ip = 1 , no_partit ions (decomposition, 1) 
p = part ition_index(decomposit ion , ip , jp,kp) 
if (part ition.owner (decomposition ,p) . eq. 2) then 
call set. .part it ion_owner (decomposition ,7 ,p) 
end if 
end do 
end do 
end do 

Alternatively, when t tie particular location of the partition in the decomposition is irrel- 
evant, the triple loop can be written as a single canonical loop over all partitions in the 
decomposition: 

do p = 1, totaL_no_partitions(decomposition) 
if (part it ion_owner(decomposition ,p) .eq. 2) then 
call set .part it ion.owner (decomposition , 7 ,p) 
end if 
end do 

Execution of commit .decomposition is necessary when construction of a decomposition 
is finalized. It serves to check consistency of the definition of the decomposition and to 



compute some information concerning communication schedules. It is possible to modify 
some of the data structures underlying a decomposition that make the actual decomposition 
invalid. For example, if the number of cuts in a certain dimension of the grid is changed, the 
partition ownership schedule contained in the decomposition can no longer be valid. In that 
case commit-decomposition will fail and a new decomposition must be constructed. If the 
decomposition is found to be valid but to not qualify as one of the special predefined types 
anymore, the special internal attribute is reset. 

Once a decomposition has been formed, distributed variables can be associated with 
it. The function create_distribution creates a distributed array of some elementary 
data_type (MPI_integer, MPI_REAL, etc), whose storage is reserved at some specified starts 
address. The user also specifies the tensor rank of the variable, plus an array of numbers of 
components for eacli index of the rank. For example, setting rank equal to 2 and components 
equal to (3,3) defines a 3 x 3 matrix at each point of the grid. To accommodate stencil 
operations the user specifies a positive number of ghost_points. To determine whether 
enough space lias been allocated for the local portion of the distributed variable, the user 
can request the number of units of the elementary data type needed, using storage_space. 
Syntax of Fortran 77 distribution function: 

subrout ine create.dist ribut ion (distribution , decomposition , dat a_type , 

$ start_address , rank, components , ghost .points) 

integer distribution, decomposition, data.type, rank, 
components(*) , ghost_points 
<type> start .address (*) 

integer function storage. space(decomposition) 
integer decomposition 

Here <type> refers to a range of memory locations reserved for storage of elements of type 
data_type. By default, multiple partitions owned by the same processor are stored such that 
each partit ion takes up an equal amount of space. The layout is consistent with a storage 
declaration that allocates to each partition a subarray of identical dimensions. This will, in 
general, create gaps, which is wasteful. But it does allow uniform and simple declaration 
of complex distributed variables. Space can be conserved by calling the function compact, 
which eliminates any gaps, but necessitates the use of Charon access function offset (see 
Section 4) to det ermine where a particular partition starts in memory. Since every partition 
may have different subarray dimensions when compacted, suitable dimensioning statements 
may 7 require calls to part it ion_size. Complete control over memory allocation is got by 
specifying explicitly where each partition p starts in memory, and what the subarray 7 di- 
mensions are. Such specifications, or calls to compact, must be made before any 7 part of 
the distributed variable is used, because they affect important Charon functions, such as 
address, and value. 

Syntax of Fortran 77 layout functions: 

subroutine compact (distribution) 
integer distribution 
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subroutine set_of i set (distribution, location ,p) 
integer distribut i on , location, p 

subrout ine s et_ array _dimens ion (distribut ion , dimensions ,p) 
integer distribution, dimensions(*) , p 

Finally, we note that all grid, partition, decomposition and distribution creation and 
manipulation operations are global, which means that all processors in the corresponding 
communicator must call these routines with the same parameters. 

4 Execution support tools 

At the highest level of abstraction, Charon must present an interface that makes the transi- 
tion from a serial to a correct parallel implementation simple and straightforward. The user 
need not be concerned about details of the domain decomposition, local and remote data, 
concurrency, communication, etc. Effectively, the top level programming tools support the 
Charon data distributions (as do t he other levels), but hide the distribution aspects from 
the user. This is generally inefficient, but that is not a problem. In subsequent refinements 
performance improvements can be obtained, again making use of Charon tools. 

We note that Charon does not provide an automatic code conversion capability. All 
parallelization is carried out by the user, who retains complete control over data lay-out 
and program flow. Hen :e, the necessary code changes must be kept at a minimum. For 
that purpose Charon offers execution support tools that simulate a single data space and a 
single thread of control. Assignments and control structures are exact images of the serial 
program, and the resulting code is executed by all processors; Charon simulates a single, 
replicated program counter. We use the following rationale for the implementation. Each 
element of a distributee variable has a unique owner processor, so it is most natural— 
and often least expensive — to employ an owner- computes rule: whenever an element of a 
distributed variable occurs on a left-hand side of an assignment, the processor who owns 
it is responsible for its e valuation. But since all processors execute the same code within 
the same control structure, we have to provide a mechanism to skip assignments to nonlocal 
memory locations; the replicated program counters ‘pause 1 2 on all processors, except on the 
one executing the local assignment, and ‘resume 1 collectively immediately thereafter. The 
obvious way to implement (nested) loops over distributed data structures is as follows: 

1. Compute the intersection of the index sets of the loop and of the locally owned part of 
the distributed variable(s) on the left hand side of the assignment (s) in the loop body. 
This is the execution mask. 

2. Execute the statements covered by the mask on each processor independently. 

This would be the equivalent of a High Performance Fortran FORALL statement [2]. It is 
important to recognize that loops cannot be parallelized this way in general. Not only does 
it violate the principle ol a simulated single program counter, it also constitutes a reordering 
or splitting of the original loop, which cannot be done with impunity in case of recurrences or 
other dependencies within the loop body or across iterations. Note also that locally owned 
index sets in Charon can be arbitrarily complex, due to the flexibility of the partitioning 
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mechanism, and can generally not be described explicitly in the same loop structure as that 
of the serial program. 

We use the following approach instead. Each assignment in the serial program is trans- 
lated into a call to an assign routine, which takes as arguments a left hand side (an address) 
and a right hand side (a value). If the address is NULL (not reachable), no assignment takes 
place, and the statement is effectively not executed by the calling processor; it is masked. 
NULL addresses result from (unintentional) memory errors, and from assignments to nonlocal 
memory locations. The masking is obtained by using the function address, which returns an 
actual location for a local element of a distributed variable, and NULL for a nonlocal element. 
Masking alone is not sufficient, however, because the rigid hand side to be evaluated can also 
be arbitrarily complex, with possible contributions from local and remote memories. For this 
purpose the function value is introduced. It operates on distributed variables and always 
returns the proper value, or perhaps a ‘conventional’ segmentation fault in case of a pro- 
gramming error. No distinction is made between values returned by the function value and 
values of nondistributed variables and constants. All are rvalues in C terminology. Similarly, 
no distinction is made between addresses returned by the function address, and addresses 
of nondistributed variables. Both are lvalues in C terminology. In a correct program using 
only the high-level Charon tools, all rvalues always exist on each node, whereas lvalues are 
only defined if they are local. Alternatively, we may say that the highest level of abstraction 
of Charon only implements (implicitly invoked) remote gets, not puts. Implementation of 
remote gets does not require one-sided communication, since all processors call all assign 
routines and hence can cooperate in the assembly of each value to be assigned. Because 
rvalues must always be defined, value causes the processor that owns the particular data 
item to broadcast it to all other processors in the same communicator. 

The names of each of the functions assign, value, and address, and many others, can 
be changed by the user by editing a dictionary before installing Charon. Tins allows for 
terser code, at the possible expense of reduced readability. 

Syntax of (Fortran 77) global access functions (var constitutes a handle to a distributed 
variable): 

subroutine assign(my_address ,my_value) 

<type> my_address , my.value 

<type> function value(var, indexl , index2 , ...) 

integer var, indexl, index2, ... 

<type> function address (var , indexl , index2 , ...) 

integer var, indexl, index2, ... 

Observations: 

• Fortran 77 does not make explicit distinction between rvalues and lvalues in declara- 
tions, so the definition of assign appears symmetric in its arguments. In Fortran 90. 
my .address would be declared with intent (out), and my .value with intent (in). 
Also in Fortran 90, my .address can be declared as a symbolic name, or as a pointer 
variable. In C or C++, my_address is a pointer and my .value is an actual value. 
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• The assign operator is overloaded; it can take values and addresses of any elementary 
type, as long as they are consistent (that is, value and address must refer to the 
same type). The user is responsible for the match. Errors cannot be detected by the 
compiler or runtime system. 

• The generic functio i value specializes to real_value, integerjvalue, logical_value. 
double_precision_value and character_value, depending on the type of the dis- 
tributed variable whose handle (var) it receives. Similar specializations hold for the 
generic function address. In addition, value and address allow variable-length pa- 
rameter lists in order to accommodate distributed variables of differing dimensions. 
Integers indexl , index2 , ... are global indices with respect to the whole grid. Note 
that the result of the function address is used functionally as an lvalue by Charon, 
which is not possible in Fortran. 

For the reasons listed above, assign, value and address are all implemented in C. However, 
they are callable from Fortran. Correctness (i.e. serial consistency) of a program utilizing 
only these routines is easily shown, even though Charon makes no assumptions about lock- 
step execution or other synchronization features of the runtime system, and does not pose 
any restrictions on data dependencies in the program. Each invocation of assign requires 
(lie cooperation of the processor that owns a referenced remote data element. Because all 
processors execute the Fame code, any update of such referenced remote data occurring 
logically before the value is requested must already have taken place before the request is 
registered and satisfied; synchronization is performed automatically. This is equivalent to 
realigning t he replicat ed program counter. A side effect of the cooperative nature of implicit ly 
invoked communications is that they must be issued as broadcast operations. A processor 
executing t he value fund ion must take active part in sending data, but cannot know^ which 
processor is the recipier t until address has been evaluated. Both the address and the 
expression involving value are arguments of the assign routine, and Fort ran semantics say 
that they may be executed in arbitrary order. Hence, the rvalue may need to be evaluated 
before the destination address is known, which implies that the rvalue be available to oil 
processors in t he communicator. A broadcast is required. If the lvalue is not a distributed 
variable — in other word* , if it is a global variable — the address routine will not be used. 
Global variables are automatically self-consistent, because each processor assigns the same 
(broadcast ) value to its local copy 1 . 

The simplest optimization that can be performed is to by-pass execution of assign 
statements involving rerrote data elements, so that no broadcasts need to take place. Charon 
is notified of this by the bracketing statements begin-Iocal and endJ.ocal. 

As an example Charon application we transform a serial code fragment that computes a 
distributed variable pr on a two-dimensional grid and counts the number of times it drops 
below zero. 

count = 0 

do j = 1, nj 

~ 1 Temporary variables insi 1e loops arc also global variables, and assignments to them will invoke broadcast 

operations if the right hand side expression contains distributed variables. 
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do i = 1, ni 

pr(i , j ) = a(i , j ) **2 - 1.0/b(i,j) 
if (pr(i,j) .It. 0.0) count = count+1 
end do 
end do 

Here is the first parallelized version. Handles to distributed variables are indicated by an 
underscore (_) suffix. 

count = 0 

do j = 1 , n j 

do i = 1 , ni 

call assign(address(pr_ , i , j ) , 

$ value(a_ , i , j )**2 - 1 . 0/value(b_ , i , j) ) 

if ( value (pr_ , i ,j) .It. 0.0) count = count+1 
end do 
end do 

We use the generic names for the Charon access functions for brevity. Notice that the 
structure of the parallel loop nest is identical to that of the original version, and that we 
have not made any assumptions about how the grid has been partitioned. The above loop 
nest is completely serialized, with only one processor making assignments to the distributed 
variable pr at any one time. Notice also that all processors execute the conditional statement 
for every iteration, causing a potentially large number of remote-memory accesses 

In order to improve the performance of this loop while retaining the original structure, we 
make use of function point_owner to test whether a point in the decomposition (signified by 
the handle decomp) is assigned to the calling processor (identified by the variable my .rank. 
We only execute the loop body if the outcome is true. This means that only one processor 
will be incrementing the counter, but the final tally needs to be known to all processors. Since 
we now allow violation of the principle of a single program counter by prescribing different 
actions depending on the processor number, Charon can no longer automatically guarantee 
correctness of the program and some user intervention becomes necessary in the form of a 
reduction operation. The following version eliminates all extraneous synchronizations and 
redundant communications: 

counttmp = 0 

call begin_local (decomp) 
do j = 1 , n j 
do i = 1, ni 

if (point ^.owner (decomp, i ,j ) .eq. my_rank) then 
call assign(address_ (pr_ , i , j ) , 

$ value(a_ , i , j )**2 - 1 . 0/value(b_ , i , j ) ) 

if (value(pr_ , i , j ) .It. 0.0) counttmp - counttmp+1 
end if 
end do 
end do 
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call end_local (decomp) 

call MPI_allreduce(counttmp, count , 1 ,MPI_INTEGER,MPI_SUM, 

$ MPI_C0MM_ WORLD , ierr ) 

It should he noted that this version of the loop nest is still much more expensive than 
a hand-coded message-passing version. This is due to the fact that function calls are made 
during every iteration. Moreover, all processors do need to perform the ownership test for all 
elements of 1 tie iteration index space. The next optimization step is obtained by restricting 
the iterat ion index space \ priori. If we assume t hat each processor owns exactly one partit ion 
of the grid, the following code results: 

counttmp = 0 

call begin.local (decomp) 

do j = own_low(decomp , 2 , 1) , o wn_high (decomp , 2 , 1) 
do i = own_low(decomp , 1 , 1) , own_high (decomp, 1 , l) 
call assign(address(pr_ , i , j) , 

$ value(a_ ,i, j)**2 - 1 . 0/value(b_ , i , j)) 

if (value (pr_ ,i , j) .It. 0.0) counttmp = counttmp+1 
end do 
end do 

call end_local (decomp) 

call MPI_allreduce(counttmp , count , 1 ,MPI_INTEGER,MPI_SUM, 

$ MPI_C0MM_W0RLD , ierr) 

This code optimization exposes the domain decomposition. If the decomposition had con- 
sisted of an arbitrary number of grid part it ions per processor, then the above loop nest would 
have changed as follows: 

counttmp = 0 

call begin_local (decomp) 

do p = 1, own_no_part itions (decomp) 

do j = own_low(decomp , 2 ,p) , own_high(decomp, 2 ,p) 
do i = own..low(decomp , 1 ,p) , own_high (decomp , 1 ,p) 
call assign (address(pr_ , i ,j ) , 

$ value(a_ , i , j )**2 - 1 . 0/value(b_ , i , j ) ) 

if (value (p:r_ , i , j ) .It. 0.0) counttmp = counttmp+1 
end do 
end do 
end do 

call end_local (decomp) 

call MPI_allreduce(counttmp , count , 1 ,MPI_INTEGER,MPI_SUM, 

$ MPI_C0MM_WQRLD , ierr) 

Notice that point indexing is still global, i.e. with respect to the global grid. The price 
for this convenience is the calls to assign, address and value in the loop body. We now 
eliminate these and write the complete final loop as: 
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counttmp = 0 

do p = 1, own_no_partitions (decomp) 

do j = 1, own^part it ion_size(decomp,2 ,p) 
do i - 1, own.part ition_size(decomp, 1 ,p) 
pr(i,j,p) = a(i,j,p)**2 - 1.0/b(i,j ,p)) 
if (pr(i,j,p) .It. 0.0) counttmp = counttmp+1 
end do 
end do 
end do 

call MPI_allreduce(counttmp , count , 1 ,MPI_INTEGER,MPI_SUM, 

$ MPI _C0MM_W0RLD , i err ) 

Now there is no need anymore to place calls to the begin/endJLocal pair, because there 
are no more calls to the value routine. We have finally obtained a code fragment that is 
as efficient as a hand-coded message-passing version. It is also almost as complicated as 
a message-passing code, so it would appear that nothing has been gained. However, the 
important difference with other systems is that this optimized code may be freely combined 
with high-level unoptimized code fragments — even within the same subroutine — that do not 
contain any references to the domain decomposition. It is the programmer's responsibility 
to provide the proper declarations and dimension statements for the distributed variables. 
In the above example arrays pr, a, and b were declared using three indices. Rut it would 
have been equally legal to write the loop over the partitions owned by the calling processor 
as follows (assuming compact storage of the distributed arrays): 

counttmp = 0 

do p = 1, own_no_partitions(decomp) 
si_pr = of f set (pr_ ,p) 
si_a = offset(a_ ,p) 
si_b = offset(b_ ,p) 

ni_p = own_partition_size(decomp , 1 ,p) 
n j _p = own_part it ion_ size (decomp ,2 ,p) 

call sub_loop(pr(si_pr) ,a(si_a) ,b(si_b) ,ni_p,nj_p, counttmp) 
end do 

call MP I _ all reduce (count tmp , count , 1 ,MPI_ INTEGER ,MPI_SUM , 

$ MPI _CCM-L WORLD , i err ) 


subrout ine sub_loop (pr , a , b , ni , nj , counttmp) 
integer ni, nj , i, j, counttmp 
real pr (0 :ni+l ,0 :nj+l) , a(ni,nj), b(ni,nj) 
do j = 1, nj 
do i = 1, ni 

pr(i , j ) = a(i , j ) **2 - 1.0/b(i,j)) 
if (pr(i,j) .It. 0.0) counttmp = counttmp+1 
end do 
end do 
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return 

end 

Here we have assumed that arrays a and b are defined without any ghost point s, while pr has 
a border of size 1. The Charon functions local-part it ionosize and start-index are used 
to interrogate the layouts of the decomposition and the distributed arrays, respectively. For 
complete portability we may use the interrogation function no_ghost_point s (var Jhandle) 
to avoid implicit assumphons about ghost points. 

The above code fragment loops over the locally owned part itions in the canonical fashion, 
i.e. in the order in which Charon numbers the partit ions internally. If a particular order of 
visits by sub_loop were required, for example because synchronizations need to be performed 
after each layer of partitions in the j-direction lias been completed, we can use additional 
interrogation functions. We also make use of the function own_part it ion_index, which 
returns for a particular partition its sequence number on the calling processor if it is owned 
by that processor, or -1 otherwise. 

counttmp = 0 

do jp = 1, no_partitions(decomp, 2) 
do ip = 1 , no.part it ions (decomp , 1) 
p = own.part it ion_index (decomp, ip, jp) 
if (p .ge. 0) then 

si.pr = of f net (pr_ ,p) 
si_a = offnet(a_ ,p) 
si_b = offnet(b_ ,p) 

ni_p = own. .part it ion_ size (decomp , 1 ,p) 
nj -P = own..partition_size(decomp,2 ,p) 

call sub_loop (pr (si_pr) ,a(si_a) ,b(si_b) ,ni_p ,nj_p , counttmp) 
end if 
end do 

call MPI„barr ie::(MPI_CQMM_WORLD , ierr) 
end do 

call MPI_allreduce(counttmp , count , 1 ,MPI_INTEGER,MPI_SUM , 

$ MPI_C0MM_WQRLD , ierr) 


We now move to the more complex example of operations on arrays of matrix variables 
involving recurrences. The serial code represents the forward-elimination phase of a set of 
independent block-tridiagonal linear equations. Each of the diagonals low, main, and up 
consists of (4 x 4)-blocks. The right hand side vector r, which will be overwritten by the 
solution, consists of (4x l)-blocks. The recurrence is in the i-direction 

do j = 1 , n j 
do i = 1, ni-1 

call invert ^overwrite (up (1 , 1 , i , j ) ,main( 1 , 1 , i , j ) ) 

call vmvert .overwrite (r (1 , i , j ) ,main(l , 1 , i , j ) ) 

call multiply .add (main ( 1 , 1 , i+1 , j ) , low(l , 1 , i+1 ,j),up(l,l,i,j)) 
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call vmultiply_add(r(l , i+1 , j ) , low(l , 1 , i+1 , j ) ,r (1 , i , j ) ) 
end do 
end do 


subroutine invert_overwrite(mat 1 ,mat2) 
real matl(4,4), mat2(4 > 4), temp(4,4), pivot 
! code that overwrites matl by mat2' s {-l}*mat 1 
pivot - 1 .0/matl(l, 1) 

return 

end 

subroutine vinvert_overwrite(vec ,mat ) 
real vec(4) , mat(4,4), temp (4 , 4) , pivot 
! code that overwrites vec by mat2 A {-l }*vec 
pivot = 1.0/mat(l,l) 

return 

end 

subroutine mult iply_add (mat 1 ,mat2 ,mat3) 
real matl(4,4), mat2(4,4), mat3(4,4) 

integer n, m, k 

! code that overwrites matl by mat 1-mat 2*mat3 
do n = 1, 4 
do m = 1 , 4 

do k = 1 , 4 

matl(n,m) = matl (n,m) -mat2(n ,k)*mat3(k,m) 
end do 
end do 
end do 
return 
end 

subroutine vmult iply_add(vecl ,mat ,vec2) 
real vecl(4), vec2(4), mat(4,4) 

integer n, k 

! code that overwrites vecl by vecl-mat*vec2 
do n = 1 , 4 
do k = 1, 4 

vecl(n) - vecl (n) -mat (n,k)*vec2(k) 
end do 


end do 
return 
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end 


The difficulty with this code fragment is that the statements that do the actual work are 
in subroutines that, have no knowledge of the overall partitioned grid. They operate on 
addresses and neighboring memory locations that are passed through parameter lists. Hence, 
the strategy of translating every assignment into a call to the Charon assign subroutine 
does not work. The solution is to demand that (v) invert .overwrite and (v)mult iply_add 
execute atomically* which means that there must be a single, known address that acts as the 
start of a region of lvalue** on the same processor . No other values may be modified within the 
same subroutine. There nay be several contiguous regions of rvalues, whose sizes must also 
be known at run time. We now translate the above calling program; (v) invert .overwrite 
and (v)mult iply.add remain unchanged. 

do J = 1, nj 
do i = 1, ni-1 

call invoke ( invert .overwr it e , address (up. ,l,l,i,j),l, 

$ mvalue(16 ,main_ , 1 , 1 , i , j )) 

call invoke (v invert. overwr it e , address (r_ , 1 , i , j ) , 1 , 

$ mvalue(16 ,main. , 1 , 1 , i , j )) 

call invoke (mult iply.add , address (main. ,l,l,i+l,j),2, 

$ mvalue (16 , low. , 1 , 1 , i+1 , j) ,mvalue(16 ,up. , 1 , 1 , i , j )) 

call invoke (vmult iply.add , address (r. , 1 , i+1 , j ) , 2 , 

$ mvalue (16 , low. , 1 , 1 , i+1 , j ) ,mvalue(4 , r_ , 1 , i , j ) ) 

end do 
end do 

Syntax of Fortran 77 bulk access functions: 

subroutine invoke 'subf , my. address ,no_inbufs ,my_valuesl , my. values2 , . . .) 
external subf 
integer no.inbufs 

<type> my.address (*) , my.valuesl (*) , my_values2(*) , ... 

<type> function mvalue (n , var , indexl , index2 , index3, .... ) 
integer n, var, index 1, index2, index3, ... 

The general-purpose routine invoke examines the argument my.address and determines 
which processor is responsible for the execution of subroutine subf, based on the owner- 
computes rule. All other processors will skip the invocation of subf, but they will cooperate 
in providing rvalues, as needed, through communications. Again, atomicity is assumed, i.e. 
the processor that owns the first element of the distributed variable in the mvalue argument 
list is also responsible for supplying subsequent elements. If the first element is local, the 
action of mvalue is similar to that of value. If remote, a broadcast operation requesting 
n data elements is initiated. Upon completion, the function mvalue points to the start of 
a buffer region containing the requested values. The actual number of bytes transferred 
depends on the specific data type <type> of the distributed variable. The subroutine subf 
is defined by the user. It is invoked by Charon as follows: 
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call subf (my. address ,my_valuesl ,my_values2 , . . .) 

Fortran does not provide information about the number of actual parameters with which a 
subroutine is called, so the user must indicate to subroutine invoke the number of separate 
regions of input values through the parameter no_inbufs. If the user-defined subroutine 
also takes constants or non-distributed variables as arguments, a slight complication arises, 
because Fortran does not distinguish between scalar and array arguments of subroutines. 
It does not allow the type of overloading through argument checking that C++ does. In 
Fortran all arguments get translated into addresses, that are then passed to the subroutine. 
If the argument is an expression, it gets evaluated and the address of the result is passed to 
the subroutine (cf. value). But invoke expects for each argument not scalars, but pointers 
to arrays. To coerce invoke’s arguments into this behavior without using mvalue, which 
is reserved for distributed arrays, use the function gvalue. gvalue applies to both global 
scalars and global arrays. 

Notice again that in the translated code fragment no influence of the domain decomposi- 
tion is visible. The situation changes when we start to optimize the code. First assume that 
the grid is partitioned in a stripwise fashion, such that all points on a grid line of constant j 
are within the same partition. The block-tridiagonal systems can be solved independently by 
all processors, without any communication. Hence, the first optimization is again obtained 
by using the begin/endJLocal pair. Skipping a few steps, we easily arrive at the following 
efficient code: 

do j = 1, own_partition_size(decomp,2,l) 
do 1 = 1 , ni-1 

call invert. overwrite (up (1 , 1 , i , j ) ,main(l , 1 , i , j ) ) 
call vinvert .overwrit e(r(l , i , j ) } main(l , 1 , i , j ) ) 
call mult iply. add (main (1 , 1 , i+1 , j ) , low(l , 1, i+1, j), up (1,1,1, j» 
call vmultiply_add(r(l , i+1 , j) ,low(l , 1 , i+1 , j ) ,r(l,i, j)) 
end do 
end do 

The problem becomes more interesting when the grid is partitioned differently, for example 
because there are other conflicting recurrences in the program. Assume again that each 
processor owns one partition, but this time the partition does not stretch the whole width 
of the grid. We first force the solution algorithm to proceed along stripwise sections of the 
grid, but retain the convenience of the implicitly invoked remote gets. 

do ip = 1, no.part it ions (decomp, l) 
do jp = 1, no.partit ions(decomp ,2) 
p = part it ion. index (decomp , ip , jp) 
do j = low(decomp , 2 ,p) , high(decomp,2 ,p) 

do i = low(decomp, 1 ,p) , min(ni-l ,high(decomp, 1 ,p) ) 
call invoke ( invert. overwr it e , address (up. ,l,l,i,j),l, 

$ mvalue ( 16 , main. , 1 , 1 , i , j ) ) 

call invoke(vinvert. overwrite, address(r. , 1 , i , j ) , 1 , 

$ mvalue ( 16 , main. , 1 , 1 , i , j ) ) 


16 
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call invoice (mult iply.add, address (main. ,1,1 , i+1 , j) ,2, 

mvalue ( 16 , low. , 1 , 1 , i+ 1 , j ) , mvalue ( 1 6 ,up_ , 1 , 1 , i , j ) ) 
call invoke (vmultiply. add, address (r. , 1 , i+1 , j) ,2 , 

$ mvalue (16 , low_ , 1 , 1 , l+l , j ) , mvalue (4, r. , 1 , i , j ) ) 

end do 
end do 
end do 
end do 

Observe that the original loop has been reordered, but that the recurrence relation is re- 
spected. Next, the need for frequent communications must be eliminated. Since the recur- 
rence relation lias a dep h of only one, a border of ghost points of size one suffices for the 
distributed arrays. Copying ghost point values from neighboring partitions is accomplished 
by using function copyJaces. 

Before the loop is enlered, all ghost point values are initialized. During the execution of 
the loop nest a complication arises, because the set of four bulk assignment statements in 
the loop body straddles the boundaries of partitions. Whenever the assignments ‘spill over 
into the next partition, it would be preferable to write ghost point values, rather than move 
partition ownership to another processor to do the spilled-over assignment. The mechanism 
for causing t his t o happen is the bracketing pair begin/end_ghost_access (p) , which forces 
ghost point values of partition p to be written, instead of values of points properly contained 
in other partitions (provided the ghost points exist). In addition, ghost point values are 
read, instead of requested through a communication or a local memory copy, again provided 
the ghost points exist . It is as if temporarily the partition owned by the calling processor 
has been enlarged by tie ghost points As before, exactly one processor is responsible for 
performing the computation of each distributed array element, regardless of whether the 
begin/end_ghost_acces s calls are placed. After all ghost points have been written for a 
whole strip of partitions, the values are transferred in bulk to the neighboring processors by 
calling the function copy^ghost _f aces. 

Syntax of Fortran 77 face copy functions: 

subroutine copy.f aces(var, periodicity , thickness , dimension , cut , subset) 
integer var, periodicity, thickness, dimension, cut, subset (2,*) 

subroutine copy.g no st .faces (var , period icity , thickness , dimens ion, 

cut , subset) 

integer var, periodicity, thickness, dimension, cut, subset(2,*) 

subroutine copy.f aces. all (var , per iodic ity , thickness , stencil. shape) 
integer var, periodicity, thickness, stencil. shape 

subroutine copy.gtiost.f aces. all (var , per iodi city , thickness , stencil, shape) 
integer var, periodicity, thickness, stencil. shape 

Copy_f aces copies values from the outermost ‘layers' of partitions to the corresponding ghost 
points of adjacent partit ions, thickness refers to the number of layers to be copied (smaller 
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Ilian or equal to the total thickness of the borders). If thickness is ALL, all layers are copied. 
If dimension equals p, copying takes place to neighboring partitions in the p th coordinate 
direction, where p can be positive or negative. If copying is required in a single dimension 
but in both directions, write p+ALL. The parameter cut specifies the sequence number of the 
cut in the coordinate direction defined by dimension across which the copy operation is to 
take place. Setting cut equal to ALL selects all cuts simultaneously. If the copy operation is 
PERIODIC, the sequence number of the cut may be zero, or one more than the actual number 
of cuts present (in Fortran, where the default array starting index is 1). Either case will 
be interpreted as a periodic copy. If the operation is NONPERIODIC, such cuts are ignored. 
Additionally, we may restrict the copying to a subset of a particular face, indicated by the 
two-dimensional array subset. It lists, for each coordinate direction other than that normal 
to the face to be copied, the start and end coordinates of the points of the subset with respect 
to the global grid. Alternatively, use ALL for the starting index if all points along the cut 
are to be copied. 

Often it will be useful to copy face values at all cuts in all dimensions and directions, 
especially in the case of explicit methods, where all off-processor information can be fetched 
beforehand. For this purpose the variation copy _faces_all is provided. It takes as an 
argument the stencil shape, which can have the values BOX or STAR (see [1]). The STAR shape 
ignores diagonal values and only copies between strongly connected partitions. BOX, which 
also copies values to weakly connected partitions, will result in a staged copying of data 
to reduce latency. Figure 1 shows several different, applications of the copy_f aces [_all] 
routines. 

Copy^ghost Jaces accomplishes the same as copy_faces, but copies values from ghost 
points to points properly owned by neighboring partitions. 

Notice that both types of copy functions are data-parallel. All processors participate, in 
principle, but only operate on the data that they own (if a processor is responsible for neither 
sending nor receiving, it can safely skip the copy call). In fact, the result of copy operations 
is independent of the number of processors involved, but depends only on the number and 
lay-out of the partitions. This is true for all Charon operations defined so far; distributed 
programs can be simulated, tested and debugged, to a large extent, while using only a single 
processor. One may even use Charon exclusively to obtain blocked uniprocessor code that 
optimizes data locality. Hence, the following program will run correctly on any number of 
processors. 

call copy_f aces (r_ , NONPERIODIC ,1,-1 , ALL, ALL) 
call copy_f aces (low. , NONPERIODIC , 1 , -1 , ALL, ALL) 
call copy_f aces (main_ , NONPERIODIC ,1,-1, ALL, ALL) 
do ip = 1 , no_partitions (decomp, 1) 
do jp = 1, no_partitions(decomp,2) 
p = part it ion_index(decomp , ip , jp) 
call begin_ghost_access (decomp, p) 
do j = low(decomp ,2 ,p) , high(decomp,2,p) 

do i = low(decomp, 1 ,p) , min(ni-l , high ( dec omp, 1 ,p) ) 
call invoke ( invert _overwr it e , address (up_ ,l,l,i,j),l, 

$ mvalue(16 ,main_ , 1 , 1 ,i , j ) ) 
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Figure 1: Applications of copyjfaces and copyJ:aces_all to a distributed array. 


call invoke (vinvert. overwrite, address (r_ , 1 ,i , j ) , 1 , 

$ mvalue ( 16 ,main_ , 1 , 1 ,i , j ) ) 

call invoke (mult iply_add, address (main. , 1 , 1 , i+1 , j ) ,2, 

$ mvalue ( 16 , low_ , 1 , 1 , i+1 , j ) , mvalue ( 16 ,up_ , 1 , 1 , i , j ) ) 

call invoke (vmult iply_ add , address (r_ , 1 , i+1 , j ) , 2 , 

$ mvalue (16 , low_ , 1 , 1 , i+1 , j ) , mvalue (4 , r_ , 1 , i , j ) ) 

end do 
end do 

call end_ghos:_access (decomp ,p) 
end do 

call copy_ghost._f aces (main_ , NONPERIODIC , 1,1, ip, ALL) 
call copy_ghost..f aces(r_ , NONPERIODIC , 1 , 1 , ip , ALL) 
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end do 


The above loop structure no longer requires the implicitly invoked communications. Con- 
sequently, we can relax the principle of the simulated single program counter and let each 
processor execute only its own part of the loop. But a poor load balance obtains, because 
only those processors who own partitions in the same strip of the grid can be active at the 
same time: 

call copy.f aces (r_ , NONPERIODIC , 1 , -1 , ALL , ALL) 
call copy.f aces (low. , NONPERIODIC , 1 , -1 , ALL, ALL) 
call copy.f aces (main. , NONPERIODIC ,1,-1 , ALL, ALL) 
call begin.local (decomp) 
do ip - 1, no.part it ions (decomp, 1) 
do jp = 1, no.partitions(decomp,2) 
p = part ition_index(decomp , ip, jp) 
if (part ition.owner (decomp ,p) .eq. my.rank) then 
call begin. ghost. access (decomp ,p) 
do j = low(decomp,2 ,p) , high (decomp, 2, p) 

do i = low(decomp, 1 ,p) , min(ni-l ,high(decomp , 1 ,p) ) 
call invoke (invert .overwrite , address (up_ ,l,l,i,j),l, 

$ mvalue (16 ,main_ , 1 , 1 , i , j ) ) 

call invoke (v invert .overwr it e , address (r. , 1 , i , j ) , 1 , 

$ mvalue (16, main., l,l,i, j)) 

call invoke (multiply. add, address (main., 1,1 ,i+l, j) ,2, 

$ mvalue ( 16 , low. ,1,1, i+1 , j ) ,mvalue(l6 ,up_ , 1 , 1 , i , j)) 

call invoke (vmultiply. add, address (r_ , 1 , i+1 , j ) ,2 , 

$ mvalue (16 , low. , 1 , 1 , i+1 , j ) , mvalue (4, r_ , 1 , i , j ) ) 

end do 
end do 

call end.ghost.access (decomp ,p) 
end if 
end do 

call copy. ghost.f aces(main. , NONPERIODIC ,1,1, ip, ALL) 
call copy.ghost.f aces(r. , NONPERIODIC, 1,1, ip, ALL) 
end do 

call end.local (decomp) 

4.1 Multi-partition version of tri-diagonal solver 

The load balance of the above loop nest can be improved by selecting another domain 
decomposition. If each processor receives not one but several partitions of the grid, arranged 
according to the diagonal multi-partition scheme [5, 6], the loop automatically becomes load 
balanced. Eliminating the invoke references, we obtain the final version of the code: 

call copy.f aces (r. , NONPERIODIC , 1 , -1 , ALL , ALL) 
call copy.f aces(low_ , NONPERIODIC, 1 , -1 , ALL, ALL) 


20 



call copy.f aces (ms in. , NONPERIODIC , 1 , -1 , ALL, ALL) 
do ip = 1, no.partitions (decomp, 1) 
do jp = 1, nc.partitions(decomp,2) 
p = own_part it ion_ index (decomp, ip, jp) 
if (p . gt. 0) then 

do j = 1, own.partition_size(decomp,2 ,p) 
ihigh = o*;n.partition_size(decomp , 1 ,p) 
if (ip . ec . no.partitions(decomp, 1)) ihigh = ihigh-1 
do i = 1, ihigh 

call insert .overwrite (up (1 , 1 , i , j ,p) ,main(l , 1 , i , j ,p)) 
call vir. vert .overwrit e(r(l , i , j ,p) ,main(l , 1 , i , j ,p) ) 
call mul tip ly_ add (main ( 1 , 1 , i+1 > j ,p) ,low(l , 1 , i+1 , j ,p) , 

$ up(l , 1 , i , j ,p) ) 

call vmi ltiply.add(r (1 , i+1 , j ,p) ,low(l , 1 , i+1 , j ,p) , r (1 , i , j ,p) ) 
end do 
end do 
end if 
end do 

call copy. ghost, f aces(mam. , NONPERIODIC , 1,1, ip, ALL) 
call copy.ghost..f aces (r. , NONPERIODIC ,1,1, ip, ALL) 
end do 

4.2 Pipelined uni-partition version of tri-diagonal solver 

If multi-partitioning is not feasible, performance of the loop nest can still be improved by 
pipelining the uni-partit on solver. We will assume for simplicity that the number of grid 
points in the j-direction is divided evenly among the processors, and that the size of each 
partition in the j-direction is divisible by the pipeline grouping factor npipe. Otherwise, 
some preconditioning would be necessary. 

call copy.f aces (r.. , NONPERIODIC ,1,-1, ALL , ALL) 
call copy.f aces (low. , NONPERIODIC, 1 , -1 , ALL , ALL) 
call copy.f aces(main. , NONPERIODIC ,1,-1 , ALL, ALL) 
call begin.local (decomp) 
do ip = 1 , no.partitions (decomp , 1) 
do jp = 1, no.partitions(decomp,2) 
p = part ition .index(decomp , ip , jp) 

no. stages - (high (decomp ,2 ,p) -low(decomp , 2 ,p)+l)/npipe 
do stage = 1, no. stages 

subset (1,1) = low (decomp , 2 ,p) + (stage-1) *npipe 
subset (2,1) = subset ( 1 , 1) -1+npipe 
if (part ition_owner(decomp,p) .eq. my. rank) then 
call begin.gho st .access (decomp ,p) 
do j = subset(l,l), subset(2,l) 

do i = low(decomp , 1 ,p) , min(ni-l ,high(decomp , 1 ,p) ) 
call invoke (invert. overwrite, address(up. , 1 , 1 , i , j ) , 1 , 
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$ mvalue (16 ,main_ , 1 , 1 , i , j ) ) 

call invoke (vinvert ..overwrite, address (r. , 1 , i , j ) , 1 , 

$ mvalue (16 ,main_ , 1 , 1 , i , j ) ) 

call invoke (mult iply.add, address (main. ,1,1, i+1 , j ) ,2 , 

$ mvalue (16, low. ,1,1, i+1 , j) , mvalue ( 16 , up. , 1 , 1 , i , j ) ) 

call invoke (vmultiply.add, address (r. , 1 ,i+l , j) ,2, 

$ mvalue ( 16 , low. , 1 , 1 , i+1 , j ) , mvalue (4, r. , 1 , i , j)) 

end do 
end do 

call end_ghost.access(decomp ,p) 
end if 

call copy.ghost.f aces (main. , NONPERIODIC, 1 , 1 , ip , subset) 
call copy.ghost.f aces (r. , NONPERIODIC , 1 , 1 , ip , subset) 
end do 
end do 
end do 

call end.local (decomp) 

The above loop deviates from what would usually be obtained by programming a pipelined 
equation solver from scratch. Most significantly, there is only one copy of the main loop body; 
no special cases have to be distinguished for processors at the begin or end of the pipeline. 
Moreover, synchronization appears in a natural place, namely after the program text that 
finishes a stage of the pipeline. This is a fringe benefit of the coding style encouraged by 
Charon. The user is led to approach the numerical problem at hand from the perspective of 
the overall data space and instruction stream, not just from the subset owned by each indi- 
vidual processor. Equally important is the Charon concept of data-parallel communications. 
All processors call each communication routine, in principle, but action and/or synchroniza- 
tion is required only by those processors that own or need the data that is communicated. 

Finally, we revert to accessing the distributed-array elements directly. This eliminates 
the overhead of the many function calls to invoke, address, and mvalue. In addition, we 
rearrange the order in which the partitions are visited. Pipelining inhibits copying interface 
data along entire cuts, and copying only a few numbers at a time can lead to a significant 
overhead if all processors must execute each instance of copy_ghost -faces. So instead, we 
write the loop nest as a set of independent pipelines, one for each strip of partitions in the 
i-direction. To accomplish this we make use of the function own.part it ion.coordinate, 
which returns for the p th partition owned by the calling processor the partition index in 
a specified coordinate direction (i.e. the sequence number of the strip of partitions). The 
copy operation after completion of each pipeline segment will finish correctly and without 
deadlock, because all processors that are involved in the communication are in the same 
strip, and hence call the copy routine. 

call copy.f aces (r_ , NONPERIODIC ,1,-1 , ALL , ALL) 
call copy.f aces(low. , NONPERIODIC, 1 , -1 , ALL, ALL) 
call copy.f aces (main. , NONPERIODIC, 1 , -1 , ALL, ALL) 
do jp = 1, no. partitions (decomp, 2) 
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if (own_partition_coordinate(decomp, 1,2) .eq. jp) then 
do ip = 1, no^part it ions (decomp, 1) 
p = part it ion_index(decomp , ip, jp) 

no. stages = (high (decomp , 2 ,p) -low (decomp , 2 ,p)+l) /npipe 
do stage = 1, no_stages 

subset (l,:i) = low(decomp,2,p)+(stage-l)*npipe 
subset(2,:i) = subset (1 , l)-l+npipe 
if (part it ion_owner (decomp, p) .eq. my.rank) then 
do j = l+(stage-l)*npipe, stage*npipe 
ihigh = partition_size(decomp,l ,p) 

if (ip .eq. no_part it ions (decomp , 1) ) ihigh = ihigh-1 
do i = 1 , ihigh 

call invert_overwrite(up(l , 1 , i , j ) ,main(l , 1 , i , j ) ) 
call v invert _overwr it e (r (1 , i , j ) ,main(l , 1 , i , j ) ) 
call mult iply_ add (main (1 , 1 , i+1 , j ) ,low(l , 1 , i+1 , j ) ,up(l,l,i,j)) 
call vmult iply_add(r ( 1 , i+1 , j ) , low(l , 1 , i+1 , j ) ,r(l,i,j)) 
end do 
end do 
end if 

call copy..ghost_f aces(main_ , NONPERIODIC , 1 , 1 , ip , subset) 
call copy. _ghost_f aces (r_ , NONPERIODIC , 1 , 1 , ip , subset) 
end do 
end do 
end if 
end do 

4.3 Additional solver optimizations 

Although the final multi partition and pipelined uni-partition codes are quite efficient, there 
are some other optimizations that, users might consider. First, there is no strict necessity to 
use ghost points for this application. They are used to satisfy the remote data requirements 
of the solver implicitly hy duplicating such data in the location where they are expected by 
the calling processor. But explicit message passing could also have been used (the lowest level 
of abstraction in Charon >, in which send and receive buffers are managed by the user. Buffer 
data can then be integrited in the computation directly as it arrives, without first being 
stored in the ghost, point locations. This is not necessarily faster than using ghost points, but 
it saves space. Second, a certain overhead is incurred by letting processors perform a loop over 
partitions they do not own, instead of focusing on those they do own. Since no computational 
work is done on nonlocal partitions, this overhead is small, except in the case of extremely 
fine partitioning. Then the multi-partition method can be further improved by computing in 
advance which part ition in each strip of part itions is owned by the calling processor Third, 
if explicit message passing were used, some message aggregation would be possible by fusing 
co- located calls to the copy routines. Because Charon provides the mechanism to construct 
coarse-grained parallel programs with relative ease, latency reduction through such fusings 
is generally minimal. These optimizations should only be considered if absolutely required, 
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since they increase the programming complexity significantly. 

A potentially more useful optimization is obtained by using asynchronous communication 
calls icopy_f aces, icopyjf aces_all, etc. Such calls return immediately without blocking 
on the sending or receiving side. A copy_wait call must be issued at the point in the program 
where the expected data is actually used. 

4.4 Data redistribution 

Certain applications feature a succession of different and very strong data dependencies 
during the course of the computation. Such programs may benefit from a dynamic redis- 
tribution of part of their data to reduce the frequency of remote-memory accesses. This is 
accomplished by the routine redistribute. In most cases redistribution results in a global 
exchange of data, which is an expensive operation that should be applied with care. If pos- 
sible, the asynchronous version iredistribute should be used. The routine assumes that 
both the source and the destination distributed variables have been predefined. Obviously, 
redistribution can only take place between distributed variables of the same tensor rank and 
vector dimensions, defined on the same grid. Increased efficiency is obtained for variables 
that are also based on the same partition, since this reduces the fragmentation of the packet s 
of data to be communicated between processors. If space for the two distributions (partially) 
coincides, Charon treats the redistribution as an in situ operation, which is usually more 
expensive than in the case of spatially disjoint copies. 

Syntax of Fortran 77 redistribution function: 

subroutine redistribute (to. var ,f rom.var) 
integer to.var, from.var 

5 Charon utilities 

The previous sections described the basic Charon functions that can be used to write parallel 
programs in a piecemeal fashion. However, for the conversion of legacy codes some more 
utilities are needed. Common practice in the writing of scientific computing codes includes 
overindexing, and the introduction of lower-dimensional array segments. Overindexing is 
especially useful on vector computers, where it serves to increase the vector length of inner 
loops. Lower- dimensional array segment s are used as scratch space, or to provide a convenient 
local reference to a higher-dimensional array. We will show how to implement these three 
techniques using Charon. 

5.1 Specialized and generalized distributions 

Lower-dimensional segments of distributed arrays are called slices . They are defined using 
the command create_slice. Since they use the same space as the arrays from which they 
are ‘carved 1 , no pointers to memory locations are necessary. The syntax of the command is 
as follows: 

subroutine cr eat e_ si ice (var_ si ice ,var ,no_dims, indices) 
integer var.slice, var, no.dims, indices(*) 
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The parameter no_dims ndicates that dimensions 1 through no_diins ol the original dis- 
tributed array (signified by var) are retained. The array indices contains an ordered list 
of dropped — and hence constant — indices of the parent array. If the parameter var_slice 
refers to an existing, valid slice, that old slice definition is deleted and overwritten by the new 
one. This is consistent with the programming practice of sweeping over higher-dimensional 
arrays in a slicewise fash on. Depending on the size and the number of operations on the 
slice, calls to create_slice may represent a significant overhead. Higher efficiency can be 
achieved by creating and storing the slices only once in a preprocessing step. All operations 
on partitions (other than those that change the layout) can also be applied to slices. A 
slice will generally consist of a number of segments of partitions contained in t he parent dis- 
tributed array. These segments inherit their local sequence number from the corresponding 
partition. 

Arrays of lower dimension than that of the grid that are used as scratch space are defined 
using a generalization of the create_distribut ion routine. This results, as in the case of 
create_slice. in the creation of a distribution. 

subrout ine create_generalized_distribut ion(var decomposition , data.type , 

$ start .address , rank , components , ghost .points , index , 

$ leading .posit ion) 

integer var, decomposition, data.type, rank, component s (*) , ghost. points , 
$ index(*) , leading.posit ion 

<type> start _addi ess (*) 

The value of index (i) ir dicat es which index of the subarray corresponds to grid dimension 
i. If the value is negative , say -p, then that grid dimension is excluded from the generalized 
distribution, and its index is fixed at p. create_generalizedjdistribut ion can also be 
used to create plain distributions by setting index(i) equal to i for all dimensions of the 
grid. Permutations of grid indices are obtained by making index a proper permutation of 
the grid dimensions. Fir ally, the user can choose the relative positions of grid and tensor 
indices. By setting the parameter leading_posit ion to TENSOR — the default for the stan- 
dard create_distribut:ion routine — the tensor indices are the fastest varying components 
of the distributed variable (in Fortran). Choosing GRID makes the grid indices var} 7 fastest . 

The technique of overindexing is the most complicated to accommodate, because it relies 
heavily on the implicitly defined storage format of Fortran arrays. It can also be relatively 
expensive when applied carelessly to loops that run over multiple partitions, and should 
generally be avoided in parallel programs. But since it is widely practiced in traditional 
scientific computing, some support for it must be provided. The Charon answer is to fuse 
a number (no_dims) of array indices of a distribution, thus creating an alias for the original 
distribution. The result igain obeys all the rules for distributions. 

subroutine create. .f used.distribut ion(var_alias , var ,no_dims) 

integer var.alias , var, no.dims 

As an example, if two dimensions (the first two) of a four-dimensional distributed array 
are fused, the variable is henceforth indexed using only three indices. Note that contiguous 
array elements in the seual code may not be contiguous anymore in the distributed array. 
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even when they are within the same partition, due to the existence of ghost points. The 
Charon functions value and address will take proper care of this, and the results will be as 
expected, as if there were no ghost points. But the step from the high-level Charon code to 
the lower-level concurrent code with direct array access may no longer be as straightforward 
as before. 

5.2 Irregular remote data requests 

So far we have only discussed the facilitation of regular remote array accesses through the 
Charon copy_f aces routines. But sometimes it. is necessary to make reference to remote data 
in an arbitrary pattern. For this purpose Charon provides a device called copy.tile, which 
places in the memory of particular processors a copy of a Cartesian subset of a dist ributed 
variable. 

subrout ine copy _t i le (var , comm , start .address , subset ) 
integer var, comm, subset (2,*) 

<type> start_address(*) 

As all Charon communication routines, copy_tile is called by all processors in the communi- 
cator for which the distributed variable is defined. If the calling processor is a member of com- 
municator comm, then a copy of the Cartesian subset defined by n”=i [.s?//xse/,(l , i), subset. (2, *)] 
is placed in a buffer at start.address. An asynchronous, nonblocking version is also avail- 
able under the name of icopy_tile. 

There are certain similarities between copy.tile and the remote data requests in the 
Global Arrays package [3], as both mechanisms allow subsets of global data to be gathered. 
However, in the case of Global Arrays, such subsets are restricted to two-dimensional subsets 
of matrices, whereas in Charon subsets of arbit rary dimension are allowed. Another difference 
is that in Charon copying is a collective operation, and no explicit synchronization is required, 
except in the case of asynchronous transfers. By contrast, Global Arrays uses one-sided 
communication, which always requires synchronization. 

5.3 Parallel I/O 

Using the newly defined parallel I/O subset of MPI 2 [7], distributed variables can be written 
to a file in a single step, as was demonstrated in the multi-partition parallel flow solver 
RANS-MP [6]. Syntax: To be determined. 

6 Conclusions 

A practical design for a system to accommodate piecemeal conversion of serial legacy codes 
or designs to efficient distributed-memory message-passing programs has been presented. 
Although Charon targets complicated structured-grid applications, its principle extends to 
other areas as well. What is required is a set of functions that gives the user global access to 
elements of distributed data structures, in the vein of the Charon structured-grid functions 
value and address, supplemented with a set of functions to create and manipulate such 
data structures. The latter, as well as suitable data-parallel communication functions, are 
likely to be specific to the application area. 


26 


Whereas Charon does not itself comprise an automatic parallelization facility, it is ame- 
nable to the use of such tools. In principle, all the user need do is determine the distribut ions 
of large arrays. Using these as inputs, it can be inferred automatically where and how to 
place calls to assign, invoke, address, value, and mvalue. Subsequent tuning is then 
carried out by the programmer. 

Finally, we summarize the feat ures that distinguish Charon from systems that are based 
exclusively on (semi-) automatic program translation. Charon: 

• gives the user complete control over data distribution, including advanced schedules 
such as multi-partitioning. 

• gives the user complete control over memory usage and data layout within each pro 
cessor, including coincident arrays and hand-tuned padding. 

• gives the user complete control over granularity of the program through the flexible 
communication specificat ion mechanisms of copyjfaces/"tile. 

• allows redistribution of distributed arrays. 

• allows easy incremental program change; no new analysis is needed when modules or 
statements are added, and previous tunings are never lost. 

• poses no restrictions on programming model; although the Charon tools facilitate 
SPMD-style programming, use of different communicators provides the flexibility to 
create different contexts, and message-passing calls can always be used to support true 
MIMD programs. 

The disadvantage of this flexibility is that hand-coding is involved, and that some knowledge 
of the program structure and data dependencies is required. 
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